24 research outputs found
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
Deriving reliable region-word alignment from image-text pairs is critical to
learn object-level vision-language representations for open-vocabulary object
detection. Existing methods typically rely on pre-trained or self-trained
vision-language models for alignment, which are prone to limitations in
localization accuracy or generalization capabilities. In this paper, we propose
CoDet, a novel approach that overcomes the reliance on pre-aligned
vision-language space by reformulating region-word alignment as a co-occurring
object discovery problem. Intuitively, by grouping images that mention a shared
concept in their captions, objects corresponding to the shared concept shall
exhibit high co-occurrence among the group. CoDet then leverages visual
similarities to discover the co-occurring objects and align them with the
shared concept. Extensive experiments demonstrate that CoDet has superior
performances and compelling scalability in open-vocabulary detection, e.g., by
scaling up the visual backbone, CoDet achieves 37.0 and
44.7 on OV-LVIS, surpassing the previous SoTA by 4.2
and 9.8 . Code is available at
https://github.com/CVMI-Lab/CoDet.Comment: Accepted by NeurIPS 202
Exploring Transformers for Open-world Instance Segmentation
Open-world instance segmentation is a rising task, which aims to segment all
objects in the image by learning from a limited number of base-category
objects. This task is challenging, as the number of unseen categories could be
hundreds of times larger than that of seen categories. Recently, the DETR-like
models have been extensively studied in the closed world while stay unexplored
in the open world. In this paper, we utilize the Transformer for open-world
instance segmentation and present SWORD. Firstly, we introduce to attach the
stop-gradient operation before classification head and further add IoU heads
for discovering novel objects. We demonstrate that a simple stop-gradient
operation not only prevents the novel objects from being suppressed as
background, but also allows the network to enjoy the merit of heuristic label
assignment. Secondly, we propose a novel contrastive learning framework to
enlarge the representations between objects and background. Specifically, we
maintain a universal object queue to obtain the object center, and dynamically
select positive and negative samples from the object queries for contrastive
learning. While the previous works only focus on pursuing average recall and
neglect average precision, we show the prominence of SWORD by giving
consideration to both criteria. Our models achieve state-of-the-art performance
in various open-world cross-category and cross-dataset generalizations.
Particularly, in VOC to non-VOC setup, our method sets new state-of-the-art
results of 40.0% on ARb100 and 34.9% on ARm100. For COCO to UVO generalization,
SWORD significantly outperforms the previous best open-world model by 5.9% on
APm and 8.1% on ARm100.Comment: Accepted by ICCV2023. 16 page
EGC: Image Generation and Classification via a Diffusion Energy-Based Model
Learning image classification and image generation using the same set of
network parameters is a challenging problem. Recent advanced approaches perform
well in one task often exhibit poor performance in the other. This work
introduces an energy-based classifier and generator, namely EGC, which can
achieve superior performance in both tasks using a single neural network.
Unlike a conventional classifier that outputs a label given an image (i.e., a
conditional distribution ), the forward pass in EGC is a
classifier that outputs a joint distribution , enabling an
image generator in its backward pass by marginalizing out the label . This
is done by estimating the energy and classification probability given a noisy
image in the forward pass, while denoising it using the score function
estimated in the backward pass. EGC achieves competitive generation results
compared with state-of-the-art approaches on ImageNet-1k, CelebA-HQ and LSUN
Church, while achieving superior classification accuracy and robustness against
adversarial attacks on CIFAR-10. This work represents the first successful
attempt to simultaneously excel in both tasks using a single set of network
parameters. We believe that EGC bridges the gap between discriminative and
generative learning
Recognize Any Regions
Understanding the semantics of individual regions or patches within
unconstrained images, such as in open-world object detection, represents a
critical yet challenging task in computer vision. Building on the success of
powerful image-level vision-language (ViL) foundation models like CLIP, recent
efforts have sought to harness their capabilities by either training a
contrastive model from scratch with an extensive collection of region-label
pairs or aligning the outputs of a detection model with image-level
representations of region proposals. Despite notable progress, these approaches
are plagued by computationally intensive training requirements, susceptibility
to data noise, and deficiency in contextual information. To address these
limitations, we explore the synergistic potential of off-the-shelf foundation
models, leveraging their respective strengths in localization and semantics. We
introduce a novel, generic, and efficient region recognition architecture,
named RegionSpot, designed to integrate position-aware localization knowledge
from a localization foundation model (e.g., SAM) with semantic information
extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge
while minimizing training overhead, we keep both foundation models frozen,
focusing optimization efforts solely on a lightweight attention-based knowledge
integration module. Through extensive experiments in the context of open-world
object recognition, our RegionSpot demonstrates significant performance
improvements over prior alternatives, while also providing substantial
computational savings. For instance, training our model with 3 million data in
a single day using 8 V100 GPUs. Our model outperforms GLIP by 6.5 % in mean
average precision (mAP), with an even larger margin by 14.8 % for more
challenging and rare categories
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Multimodal representation learning has shown promising improvements on
various vision-language tasks. Most existing methods excel at building
global-level alignment between vision and language while lacking effective
fine-grained image-text interaction. In this paper, we propose a jointly masked
multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both
implicit and explicit targets for the masked signals to recover. The implicit
target provides a unified and debiased objective for vision and language, where
the model predicts latent multimodal representations of the unmasked input. The
explicit target further enriches the multimodal representations by recovering
high-level and semantically meaningful information: momentum visual features of
image patches and concepts of word tokens. Through such a masked modeling
process, our model not only learns fine-grained multimodal interaction, but
also avoids the semantic gap between high-level representations and low- or
mid-level prediction targets (e.g. image pixels), thus producing semantically
rich multimodal representations that perform well on both zero-shot and
fine-tuned settings. Our pre-trained model (named MAMO) achieves
state-of-the-art performance on various downstream vision-language tasks,
including image-text retrieval, visual question answering, visual reasoning,
and weakly-supervised visual grounding
Slimmable Generative Adversarial Networks
Generative adversarial networks (GANs) have achieved remarkable progress in
recent years, but the continuously growing scale of models makes them
challenging to deploy widely in practical applications. In particular, for
real-time generation tasks, different devices require generators of different
sizes due to varying computing power. In this paper, we introduce slimmable
GANs (SlimGANs), which can flexibly switch the width of the generator to
accommodate various quality-efficiency trade-offs at runtime. Specifically, we
leverage multiple discriminators that share partial parameters to train the
slimmable generator. To facilitate the \textit{consistency} between generators
of different widths, we present a stepwise inplace distillation technique that
encourages narrow generators to learn from wide ones. As for class-conditional
generation, we propose a sliceable conditional batch normalization that
incorporates the label information into different widths. Our methods are
validated, both quantitatively and qualitatively, by extensive experiments and
a detailed ablation study.Comment: Accepted to AAAI 202